431 Class 02

Thomas E. Love, Ph.D.

2023-08-31

Instructions for the Quick Survey

Please read these instructions before writing.

  1. Introduce yourself to someone that you don’t know.
  2. Record the survey answers for that other person, while they record your responses.
  3. Be sure to complete all 15 questions (both sides.)
  4. When you are finished, thank your partner and raise your hand. Someone will come to collect your survey.

Regarding Question 4, Professor Love is the large fellow standing in the front of the room.

Today’s Agenda

  • Data Structures and Variables
    • Evaluating some of the Quick Survey variables
  • Looking at some of the data collected in Class 01
    • Guessing Dr. Love’s Age (twice)
    • Group Guessing of Ages from 10 Photographs
  • What to work on this weekend

Chatfield’s Six Rules for Data Analysis

  1. Do not attempt to analyze the data until you understand what is being measured and why.
  2. Find out how the data were collected.
  3. Look at the structure of the data.
  4. Carefully examine the data in an exploratory way, before attempting a more sophisticated analysis.
  5. Use your common sense at all times.
  6. Report the results in a clear, self-explanatory way.

Our Quick Survey

Types of Data

The key distinction we’ll make is between

  • quantitative (numerical) and
  • categorical (qualitative) information.

Information that is quantitative describes a quantity.

  • All quantitative variables have units of measurement.
  • Quantitative variables are recorded in numbers, and we use them as numbers (for instance, taking a mean of the variable makes some sense.)

Continuous vs. Discrete Quantities

Continuous variables (can take any value in a range) vs. Discrete variables (limited set of potential values)

  • Is Height a continuous or a discrete variable?
  • Height is certainly continuous as a concept, but how precise is our ruler?
  • Piano vs. Violin

Quantitative Variable Subtypes

We can also distinguish interval (equal distance between values, but zero point is arbitrary) from ratio variables (meaningful zero point.)

  • Is Weight an interval or ratio variable?
  • How about IQ?

Qualitative (Categorical) Data

Qualitative variables consist of names of categories.

  • Each possible value is a code for a category (could use numerical or non-numerical codes.)
    • Binary categorical variables (two categories, often labeled 1 or 0)
    • Multi-categorical variables (three or more categories)
  • Can distinguish nominal (no underlying order) vs. ordinal (categories are ordered.)

Some Categorical Variables

  • How is your overall health?
    (Excellent, Very Good, Good, Fair, Poor)
  • Which candidate would you vote for if the election were held today?
  • Did this patient receive this procedure?
  • If you needed to analyze a small data set right away, which of the following software tools would you be comfortable using to accomplish that task?

Are these quantitative or categorical?

  1. Do you smoke? (1 = Non-, 2 = Former, 3 = Smoker)
  2. How much did you pay for your most recent haircut? (in $)
  3. What is your favorite color?
  4. How many hours did you sleep last night?
  5. Statistical thinking in your future career? (1 = Not at all important to 7 = Extremely important)
  • If quantitative, are they discrete or continuous? Do they have a meaningful zero point?
  • If categorical, how many categories? Nominal or ordinal?

Importing and Tidying Data

Ingesting the Quick Surveys

The Quick Survey

315 people took (essentially) the same survey in the same way.

Fall 2019 2018 2017 2016 2015 2014 Total
n 61 51 48 64 49 42 315

Question

About how many of those 315 surveys caused no problems in recording responses?

The 15 Survey Items

# Topic # Topic
Q1 glasses Q9 lectures_vs_activities
Q2 english Q10 projects_alone
Q3 stats_so_far Q11 height
Q4 guess_TL_ht Q12 hand_span
Q5 smoke Q13 color
Q6 handedness Q14 sleep
Q7 stats_future Q15 pulse_rate
Q8 haircut - -
  • At one time, I asked about sex rather than glasses.
  • In prior years, people guessed my age, rather than height here.
  • Sometimes, I’ve asked for a 30-second pulse check, then doubled.

Response to the Question I asked

About how many of those 315 surveys caused no problems in recording responses?

  • Guesses?
  • 110/315 (35%) caused no problems.

Guess My Age

What should we do in these cases?

English best language?

Height

Handedness Scale (2016-21 version)

Favorite color

Following the Rules?

2019 pulse responses, sorted (n = 61, 1 NA)

 33 46 48  56  60  60    3 | 3
 62 63 65  65  66  66    4 | 68
 68 68 68  69  70  70    5 | 6
 70 70 70  70  70  70    6 | 002355668889        
 71 72 72  74  74  74    7 | 00000000122444445666888
 74 74 75  76  76  76    8 | 000012445668
 78 78 78  80  80  80    9 | 000046
 80 81 82  84  84  85   10 | 44
 86 86 88  90  90  90   11 | 0
 90 94 96 104 104 110 

Stem and Leaf: Pulse Rates 2014-2019

(Thanks, John Tukey )

Garbage in, garbage out …

Guessing My Age (Twice)
from Class 01

The R Packages I’ll Load Today

library(janitor)
library(tidyverse)

knitr::opts_chunk$set(comment = NA)
  • If you actually run this in R, you will get some messages which we will suppress and ignore today.

From our 431-Data Page: A .csv file

I’ve placed class01_age_guesses_2022-2023.csv on our 431-data page. This includes guesses from 2022 and 2023.

Creating the age_guess Tibble

Clicking on RAW in the 431-data presentation takes us to a (long) URL that contains the raw data in this sheet.

I’ll read in the sheet’s data to a new tibble (a special kind of R data frame) called age_guess using the read_csv() function.

url_age <- 
  "https://raw.githubusercontent.com/THOMASELOVE/431-data/main/data-and-code/class01_age_guesses_2022-2023.csv"

age_guess <- read_csv(url_age, show_col_types = FALSE)

The age_guess tibble

What do we get?

age_guess
# A tibble: 92 × 5
   student   guess1 guess2 actual  year
   <chr>      <dbl>  <dbl>  <dbl> <dbl>
 1 S-2022-01     57     62   55.5  2022
 2 S-2022-02     53     53   55.5  2022
 3 S-2022-03     50     50   55.5  2022
 4 S-2022-04     48     56   55.5  2022
 5 S-2022-05     61     NA   55.5  2022
 6 S-2022-06     63     63   55.5  2022
 7 S-2022-07     67     58   55.5  2022
 8 S-2022-08     50     57   55.5  2022
 9 S-2022-09     50     50   55.5  2022
10 S-2022-10     43     56   55.5  2022
# ℹ 82 more rows

How many guesses in each year?

age_guess |> count(year)
# A tibble: 2 × 2
   year     n
  <dbl> <int>
1  2022    53
2  2023    39

How many first guesses in each year were less than 56?

age_guess |> count(year, guess1 < 56)
# A tibble: 4 × 3
   year `guess1 < 56`     n
  <dbl> <lgl>         <int>
1  2022 FALSE            19
2  2022 TRUE             34
3  2023 FALSE            17
4  2023 TRUE             22

What do the guess1 values look like?

age_guess |> 
  select(guess1) |> 
  arrange(guess1) 
# A tibble: 92 × 1
   guess1
    <dbl>
 1     40
 2     40
 3     42
 4     42
 5     43
 6     44
 7     45
 8     45
 9     45
10     45
# ℹ 82 more rows

Plot the guess1 values?

ggplot(data = age_guess, 
       aes(x = guess1)) +
  geom_dotplot(binwidth = 1)

Can we make a histogram?

ggplot(age_guess, 
       aes(x = guess1)) +
  geom_histogram()

Improving the Histogram, 1

ggplot(age_guess, 
       aes(x = guess1)) +
  geom_histogram(bins = 10) 

Improving the Histogram, 2

ggplot(age_guess, 
       aes(x = guess1)) +
  geom_histogram(bins = 10, 
        col = "yellow")

Improving the Histogram, 3

ggplot(age_guess, 
       aes(x = guess1)) +
  geom_histogram(bins = 10, 
       col = "white", 
       fill = "blue")

Improving the Histogram, 3

Improving the Histogram, 4

Change theme, specify bin width rather than number of bins

ggplot(age_guess, 
       aes(x = guess1)) +
  geom_histogram(binwidth = 2, 
       col = "white", fill = "blue") +
  theme_bw()

Improving the Histogram, 4

Improving the Histogram, 5

ggplot(age_guess, 
       aes(x = guess1)) +
  geom_histogram(binwidth = 2, 
       col = "white", fill = "blue") +
  theme_bw() +
  labs(
    x = "First Guess of Dr. Love's Age",
    y = "Fall 2022 431 students")

Improving the Histogram, 5

Add title and subtitle (ver. 6)

ggplot(age_guess, 
       aes(x = guess1)) +
  geom_histogram(binwidth = 2, 
       col = "white", fill = "blue") +
  theme_bw() +
  labs(
    x = "First Guess of Dr. Love's Age",
    y = "Fall 2022 and 2023 431 students",
    title = "Pretty wide range of guesses",
    subtitle = "Dr. Love's Actual Age = 55.5 in 2022, 56.5 in 2023")

Add title and subtitle (ver. 6)

Improving the Histogram, 7

Add a vertical line at 56 years to show my actual age.

ggplot(age_guess, 
       aes(x = guess1)) +
  geom_histogram(binwidth = 2, 
       col = "white", fill = "blue") +
  geom_vline(aes(xintercept = 56), col = "red") +
  theme_bw() +
  labs(
    x = "First Guess of Dr. Love's Age",
    y = "Fall 2022 and 2023 431 students",
    title = "Pretty wide range of guesses",
    subtitle = "Dr. Love's Actual Age = 55.5 in 2022, 56.5 in 2023")

Improving the Histogram, 7

In which year did I look older?

Create two facets, one for 2022 and one for 2023 guesses…

ggplot(age_guess, 
       aes(x = guess1, fill = factor(year))) +
  geom_histogram(binwidth = 2, col = "white") +
  theme_bw() +
  facet_grid(year ~ .) +
  labs(
    x = "First Guess of Dr. Love's Age",
    y = "# of Students",
    title = "Distribution of guesses is a bit older in 2023",
    subtitle = "Dr. Love's Actual Age = 55.5 in 2022, 56.5 in 2023")

In which year did I look older?

Numerical Summary

age_guess |> select(student, guess1, guess2, year) |> summary()
   student              guess1          guess2           year     
 Length:92          Min.   :40.00   Min.   :40.00   Min.   :2022  
 Class :character   1st Qu.:50.00   1st Qu.:52.00   1st Qu.:2022  
 Mode  :character   Median :55.00   Median :56.00   Median :2022  
                    Mean   :53.67   Mean   :55.12   Mean   :2022  
                    3rd Qu.:58.00   3rd Qu.:58.00   3rd Qu.:2023  
                    Max.   :72.00   Max.   :70.00   Max.   :2023  
                                    NA's   :3                     
  • Was the average guess closer on guess 1 or 2?
  • What was the range of first guesses? Second guesses?
  • What does the NA's : 3 mean in guess2?
  • Why is student not summarized any further?

More Numerical Summaries

  • Using the favstats function from the mosaic package
mosaic::favstats(~ guess1, data = age_guess)
 min Q1 median Q3 max     mean       sd  n missing
  40 50     55 58  72 53.67391 6.200157 92       0
mosaic::favstats(~ guess2, data = age_guess)
 min Q1 median Q3 max    mean       sd  n missing
  40 52     56 58  70 55.1236 5.359402 89       3
  • Using the describe function from the psych package
age_guess |>
  select(guess1, guess2) |>
  psych::describe()
       vars  n  mean   sd median trimmed  mad min max range  skew kurtosis   se
guess1    1 92 53.67 6.20     55   53.58 5.93  40  72    32  0.12     0.04 0.65
guess2    2 89 55.12 5.36     56   55.18 4.45  40  70    30 -0.04     0.25 0.57

How did guesses change in 2023?

  • Did your guesses decrease / stay the same / increase?
  • Calculate guess2 - guess1 and examine its sign.
age_guess |> 
  filter(year == "2023") |>
  count(sign(guess2 - guess1))
# A tibble: 3 × 2
  `sign(guess2 - guess1)`     n
                    <dbl> <int>
1                      -1    11
2                       0     4
3                       1    24

How much did guesses change in 2023?

Create new variable (change = guess2 - guess1)

age_guess <- age_guess |>
  mutate(change = guess2 - guess1)

age_guess |> filter(year == "2023") |> select(change) |> summary()
     change      
 Min.   :-5.000  
 1st Qu.:-1.500  
 Median : 1.000  
 Mean   : 1.744  
 3rd Qu.: 5.000  
 Max.   :11.000  

Histogram of Guess Changes

What will this look like?

ggplot(data = age_guess, aes(x = change)) +
  geom_histogram(binwidth = 1, fill = "royalblue", col = "yellow") + 
  theme_bw() +
  labs(x = "Change from first to second guess",
       y = "Students in 431 for Fall 2022 or 2023",
       title = "Most stayed close to their first guess.")

Histogram of Guess Changes

Guess 1 - Guess 2 Scatterplot

ggplot(data = age_guess, aes(x = guess1, y = guess2)) +
  geom_point() 

Filter to complete cases, and add regression line

temp <- age_guess |>
  filter(complete.cases(guess1, guess2))

ggplot(data = temp, aes(x = guess1, y = guess2)) +
  geom_point() +
  geom_smooth(method = "lm", formula = y ~ x, col = "purple")

Filter to complete cases, and add regression line

What is that regression line?


Call:
lm(formula = guess2 ~ guess1, data = age_guess)

Coefficients:
(Intercept)       guess1  
    22.6209       0.6067  
  • Note that lm filters to complete cases by default.

How about a loess smooth curve?

temp <- age_guess |>
  filter(complete.cases(guess1, guess2))

ggplot(data = temp, aes(x = guess1, y = guess2)) +
  geom_point() +
  geom_smooth(method = "loess", formula = y ~ x, col = "blue") +
  theme_bw()

How about a loess smooth curve?

Add y = x line (no change in guess)?

temp <- age_guess |>
  filter(complete.cases(guess1, guess2))

ggplot(data = temp, aes(x = guess1, y = guess2)) +
  geom_point() +
  geom_smooth(method = "loess", formula = y ~ x, col = "blue") +
  geom_abline(intercept = 0, slope = 1, col = "red") +
  theme_bw()

Add y = x line (no change in guess)?

With Better Labels

ggplot(data = temp, aes(x = guess1, y = guess2)) +
  geom_point() +
  geom_smooth(method = "loess", formula = y ~ x, col = "blue") +
  geom_abline(intercept = 0, slope = 1, col = "red") +
  geom_text(x = 40, y = 38, label = "y = x", col = "red") +
  labs(x = "First Guess of Love's Age",
       y = "Second Guess of Love's Age",
       title = "Student Guesses of Dr. Love's Age in 2022 and 2023",
       subtitle = "Love's actual age = 55.5 in 2022, 56.5 in 2023") +
  theme_bw()

With Better Labels

Age Guessing from Photos
(9 groups, 10 Photos)

Photos 1-5

  • Data in class01-guess10ages-2023-08-29 Google Sheet on our Shared Drive.

Photos 6-10

Comparing the 2023 Groups

Group Correct Within 2 Within 5 Too Low Too High
< 0.05 2 3 6 3 5
Complexity 2 3 6 5 3
Mini UN 0 3 6 3 7
Soon to be R Masters 0 3 5 5 5
KAPSY 0 2 6 4 6
R Amateurs 0 2 5 4 6
Stats R Fun! 0 2 4 4 6
How old R You 0 1 6 3 7
Outliers 0 1 5 4 6
  • So … who wins?
  • What other summaries might be helpful?

Summaries of Errors

Group Mean Error SD (Errors) Median Error (Min, Max) Error
KAPSY -0.1 7.8 1.5 (-13, 11)
Stats R Fun! -0.2 6.3 2 (-9, 7)
Outliers 0.4 6.9 3.5 (-12, 8)
Soon to be R Masters 0.7 6.1 0.5 (-9, 8)
R Amateurs 1.4 8.6 2 (-14, 12)
Complexity -1.7 6.5 -1 (-15, 6)
< 0.05 1.8 7.4 1 (-12, 12)
Mini UN 2.4 6.8 2.5 (-10, 14)
How Old R You 2.7 5.6 3.5 (-8, 10)
  • How helpful are these summaries in this setting?

Absolute and Squared Errors

  • AE = Absolute Value of Error = |guess - actual|, MSE = Mean Squared Error
Group Mean AE Range (AE) Median AE MSE
Complexity 4.9 (0, 15) 4.5 41.5
Soon to be R Masters 5.1 (1, 9) 5.5 34.1
How Old R You 5.3 (2, 10) 4.5 35.3
Stats R Fun! 5.4 (2, 9) 6 35.8
< 0.05 5.6 (0, 12) 3.5 52.8
Mini UN 5.6 (1, 14) 4 47.6
Outliers 6.0 (2, 12) 5.5 43.6
KAPSY 6.3 (1, 13) 4.5 54.9
R Amateurs 6.8 (1, 14) 6 69.2
  • So … now who wins?

Importing guesses from 2014-2023

temp_url <- "https://raw.githubusercontent.com/THOMASELOVE/431-data/main/data-and-code/class01-photo-age-history-2023.csv"

photos <- read_csv(temp_url, show_col_types = F)

photos
# A tibble: 110 × 12
    card label       age sex   facing year  mean_guess error abs_error sq_error
   <dbl> <chr>     <dbl> <chr> <chr>  <chr>      <dbl> <dbl>     <dbl>    <dbl>
 1     1 Chong        21 M     R      2023        27.9   6.9       6.9    47.6 
 2     2 Archuleta    64 F     L      2023        53   -11        11     121   
 3     3 Mayfield     28 F     L      2023        30.2   2.2       2.2     4.84
 4     4 Love         14 M     L      2023        16     2         2       4   
 5     5 McGinn       54 F     R      2023        61.6   7.6       7.6    57.8 
 6     6 Chaney       74 M     L      2023        72    -2         2       4   
 7     7 Storm        44 M     R      2023        47.3   3.3       3.3    10.9 
 8     8 Glantz       83 F     L      2023        78.6  -4.4       4.4    19.4 
 9     9 Honey        24 M     L      2023        32.9   8.9       8.9    79.2 
10    10 Lawson       34 F     R      2023        28.8  -5.2       5.2    27.0 
# ℹ 100 more rows
# ℹ 2 more variables: `detailed description` <chr>, jpeg <chr>

2014-2023 Errors

The large black “X”s in the plot show 2023 results.

OK. That’s it for the slides. Back to the Class 02 README.